Automatically Extracting Subsequent Response Pages from Web Search Sources

نویسندگان

  • Dheerendranath Mundluru
  • Zonghuan Wu
  • Vijay Raghavan
  • Weiyi Meng
  • Hongkun Zhao
چکیده

Usually, when Web search sources such as search engines and deep Websites retrieve too many result records for a given query, they split them among several pages with, say, ten or twenty records on each page and return only the page that has the top ranked records. This page usually provides one or more hyperlinks or buttons pointing to one or more of the remaining response pages (called subsequent response pages), which inturn contain similar hyperlinks or buttons to allow users to navigate from one page to another. Information integration systems often need to access these subsequent response pages to extract the records contained in them. However, hyperlinks or buttons pointing to subsequent response pages are often displayed in different formats by different Web search sources. Due to this it becomes a challenging task to automatically identify these hyperlinks or buttons and extract the response pages referenced by them. In this paper, we propose a novel solution to automatically fetch any specified response page from autonomous and heterogeneous Web search sources for any given query. Our approach first identifies certain important hyperlinks present in the response page sampled from an input Web search source and then further analyzes them using four heuristics. Finally a wrapper is built to automatically extract any specified response page from the input source.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Reputation Extraction Using Both Structural and Content Information

We propose a new method of extracting texts related to a given keyword from Web pages collected by a search engine. By combining structural pattern matching and text classification, texts related to a given keyword such as reputations of a given restaurant can be extracted automatically from Web pages in unfixed sites, which is impossible by conventional wrappers. According to our cross validat...

متن کامل

Information Extraction from Hypertext Mark-Up Language Web Pages

Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek specific information, the results are not only information about ...

متن کامل

Dynamic Vision-Based Approach in Web Data Extraction

The problem of extracting data records on the response pages returned from web databases or search engines. World Wide Web has posed a challenging problem in extracting relevant data. Traditional web crawlers focus only on the surface web while the deep web keeps expanding behind the scene. Deep web pages are created dynamically as a result of queries posed to specific web databases. Extracting...

متن کامل

Web Page Categorization Using Artificial Neural Networks

Web page categorization is one of the challenging tasks in the world of ever increasing web technologies. There are many ways of categorization of web pages based on different approach and features. This paper proposes a new dimension in the way of categorization of web pages using artificial neural network (ANN) through extracting the features automatically. Here eight major categories of web ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005